import sys
import sumr
reload(sumr)
base_path = 'quarter_example/'
import os
import itertools
from bs4 import BeautifulSoup
As an example, we're going to create a single document summarization for AIG's 2009 first quarter earnings announcement. The full text of the announcement can be found here.
As you can see, most of the pertinent information is concentrated in the first 10 pages of the document. Digging deeper, the main hightlights can be found in the first paragraph:
American International Group, Inc. (AIG) today reported a net loss for the first quarter of 2009 of \$4.35 billion or \$1.98 per diluted share, compared to a net loss of \$7.81 billion or \$3.09 per diluted share in the first quarter of 2008. First quarter 2009 adjusted net loss, excluding net realized capital gains (losses) and FAS 133 gains (losses), net of tax, was \$1.60 billion, compared to an adjusted net loss of \$3.56 billion in the first quarter of 2008.
These are the headline numbers that most analysts and investors pay attention to, specifically it contains the earnings figure and a comparision with the previous quarters earnings. For that reason each of the below algorithms are designed to return the first sentence of the first paragraph for each earnings announcement. However, I do not show this in the results. A separate function call naively returns this sentence for each document.
The traditional measure for summarization quality is called Recall-Oriented Understudy for Gisting Evaluatio (ROUGE)) This metric basically looks at the overlap in orders (as defined by some N-Gram window) between the automatically generated summary and one that was created manually by a human.
For simplicitys sake, I haven't generated a ROUGE score because we're only generating 6 key sentences for each of our documents. I start with the ideal summary and then move on to generating single document summaries using each of the three summarization algorithms.
American International Group, Inc. (AIG) today reported a net loss for the first quarter of 2009 of \$4.35 billion or \$1.98 per diluted share, compared to a net loss of \$7.81 billion or \$3.09 per diluted share in the first quarter of 2008.
AIG reported a \$1.9 billion pre-tax (\$1.2 billion after tax) charge for restructuring costs, primarily related to the wind down of AIG Financial Products Corp., AIG Trading Group, Inc. and their subsidiaries (collectively, AIGFP) and other.
AIG reported market disruption-related losses of \$2.5 billion pre-tax (\$1.6 billion after tax).
The Federal Reserve Bank of New York (FRBNY) Credit Agreement was amended to remove the minimum 3.5 percent LIBOR floor as of April 17, 2009.
The stabilization of rates is an improvement from the fourth quarter of 2008 and reflects the current market conditions.
The foreign exchange effect for the first quarter of 2009 was a reduction of reserves of \$290 million.
doc_list = os.listdir(base_path)
CleanText = sumr.TextCleaner(doc_list)
CleanText.read_docs(base_path = base_path)
tf_idf = CleanText.make_tf_idf()
NaiveSum = sumr.NaiveSumr(CleanText)
I weight each sentence as a function of the above features. Compared to our baseline, this leads to 3 out of the 5 manually generated summary sentences.
NaiveSum.Summarize(doc_list[0])
Specifically, we update edge weights according to the following formula:
$WS(N_{i}) = (1-\alpha) + \alpha * \sum_{N_{j} \in In(N_{i})} \frac{w_{ji}}{\sum_{N_{k} \in Out(N_{j})} w_{jk}}WS(N_{j})$
Here, $N$ denotes a node. $In(N)$ is the set of nodes with directed edges into node N, $Out(N)$ is the set of nodes with directed edges coming from node N, $\alpha$ is a dampening parameter that accounts for "jump" probability between nodes, and $w$ is the edge weights between nodes.
I set $\alpha$ to 0.7 and use levenshtein distance as the weighting between nodes. Nodes are complete sentences within the document.
This results in 2 out of 5 of the manually generated summary sentences
TextRankSum = sumr.SumrGraph(CleanText)
TextRankSum.Summarize(doc_list[0])
The steps of the algorithm are:
Note:
Recall, SVD decomposes $A = U \Sigma V^T$, where $\Sigma$ is a diagonal matrix and $U$ and $V^T$ are orthongonal matrices Therefore, $AA^T = U \Sigma V^TV \Sigma^T U^T = U \Sigma^2U^T$ and $A^TA = V \Sigma^TU^TU \Sigma V^T = V \Sigma^2V^T$. $\Sigma^2$ is a diagonal matrix, therefore it must be the case that $V$ contains the eigenvectors of $A^TA$ and $U$ contains the eigenvectors of $AA^T$.
The elements of $A^TA$ contain the dot products of each term across the document against the individual terms in each sentence. We can think of $A^TA$ as the "term covariance matrix" so that projecting the matrix onto a lower dimensional space finds the "latent" terms or topics that underlie the document.
This results in 1 out of the 5 manually generated summary sentences
lsasumr = sumr.LSASumr(CleanText)
lsasumr.make_term_sentence_matrix(doc_list[0])
lsasumr.get_singular_vector()
lsasumr.Summarize(doc_list[0])